image.png

Analysis of Stock Data

Final Project of the Udacity Nanodegree Data Scientist Program

Table of Contents

Motivation & Outline

For the final project of the Udacity Nanodegree Data Scientist Program I needed to decide on what kind of data I will work on, applying the acquired skills.

Since I did not want to put too much effort in analyzing a data set that I didn't need in the near future even after completing the Nanodegree (wether at home or at work), I chose financial data. A year ago after my first Udacity Course (Data Analyst), I already wrote a program that scrapes, transforms, visualizes and stores my personal financial data from pdf-documents (statements of earnings, statements of bank accounts and insurance data) in order to monitor the development over time, identify trends and keep track on the variety of information. For obvious confidential reasons those are not going to be part of the current project.

At the same moment I started for the first time of my life investing in stocks, funds and ETFs (which seemingly everybody did during the Covid-19 pandemic). Well knowing that I am not going to be the next Warren Buffett, I still enjoy studying in a field that I hadn't entered before and that will for sure affect me for the rest of my life. This is exactly why I want to use financial stock data for the underlying data science project with python. So far, timeseries data hasn't yet been the focus of the classes, so that I additionally needed to do some research (amongst others with the help of the free Udaciy courses "Timeseries Forecasting" and "Machine Learning for Trading").

The project requirements/steps are the following:

Project Idea/Plan:

Disclaimer

Please note, that all insights, data, findings and predictions that I make throughout the project shall not be used as basis to trade stocks in any way! There might be mistakes, incomplete analyses and biased conclusions. No Guarantees!

Problem Definition

As I mentioned before, I am currently holding stocks and ETFs and plan on eventually acquiring more in the future (without risking much). This being said, the following questions pop up:

  1. How well are the stocks performing compared to the past and/or to each other?
    • How or with which features can i quantify the performance?
  2. How well will the stocks perform in the near future and can their performance be predicted?
    • Which indicators can foreshadow future price movements and where are the limits of ML algorithms?

All of the above questions are relevant for a further question: "In which stocks or markets should I invest in the future?".

Gathering and Wrangling the Stock Data

The majority of the stated questions/tasks can be exemplarily assessed with the historical data and meta data of one single stock. However I plan on working with functions so that the upcoming algorithms can be applied to almost any kind of stock or market.

Being aware that there has already been masses of similar projects and researches, I will try to use as many useful existing APIs, Modules and Strategies to get everything to work and save brain capacity (feel free to checkout the credits and sources at the end of the notebook).

I tried 3 common (and mostly free) Python APIs to gather historical stock data with. Quandl for example seems to be well known but I had some issues navigating through their data bases and finding the stock data that I wanted. YFinance (Yahoo Finance) comes in quite handy but nevertheless I decided to use Alpha Vantage (limited to 5 requests per minute and 500 per day) which delivers a lot of data in a comfortable way. Take into account that you have to get a free API key first (same accounts for Quandl).

About the time intervals: Having a regular job, I can't react on price movement within hours or even minutes, which is why I am totally fine with the historical data on a daily basis.

Available data

As described above, there are several (free) sources of stock data and I will mostly use the Alpha Vantage API, which provides theoretically the following data:

Unfortunately there are still lots of stocks where not all data is present or even up to date (especially of stocks out of US markets) but at least the historic data is often provided. The following code contains my functions to quickly gather the described data if available. For the project revision I will provide the data as csv-files since I am not going to hand over my API keys.

For now the BMW stock will serve as an example, even though the code is intended to work for all stocks that are available at Alpha Vantage.

Unfortunately there is no company overview data for the symbol "BMW.DE"

Unfortunately there is no earnings calendar data for the symbol "BMW.DE"

Fortunately by using historic stock data from the Alpha Vantage API, there is no big data wrangling necessary. There might be lots more work if the plan was to scrape fundamental data for each timestep in the past in order to improve the analysis. This won't be a part of this notebook though.

Difference between Close Values and adjusted Close Values

When I started comparing the stock data from Alpha Vantage to the diagrams that I found on different Broker websites I was irritated by differences in value which I later found out to be resulting in the distinction between adjusted and not adjusted close values. The adjusted close values are calculated with respect to dividends, splits and new offerings. Since the historical OHLC (Open, High, Low, Close) data relates to the not adjusted close values, I will use the adjusted value only when there is no interaction with the open, high, low or volume data. For stocks with splits in their historical data, I might need a more detailed approach to avoid bias for example in a machine learning model. By the way: The YFinance API doesn't provide adjusted close values, which is one of my reasons to use Alpha Vantage where possible.

Estimate Stock Performance/Trend/Price

Of course I would be naive thinking I could easily predict future stock movements with the basic knowledge that I have, but still... You have to start somewhere, right? Let's not rush things by trying to predict the prices for the next month but rather investigate the general ideas to understand and estimate price behaviour, then put the criteria in some scalable measures and finally try to at least assume an up or down trend a few days in the future.

Usually the analysis of a stock is divided into the fundamental analysis and the technical analysis. For a quick understanding of the underlying differences i found the following page to be helpful:

Fundamental Analysis

The fundamental analysis can be understood as "looking at aspects of a company in order to estimate its value". This can be an analysis of the companies general condition, its decisions and communications but it can also be the analysis of "outside influences" like the current pandemic, or even tweets of famous people about the company or its market. Fundamental analysis can be a matter of politics, nature catastrophes and more and is especially used for long term investments. Since the focus of this project is rather about short term decision making, I will no further discuss the fundamental analysis but rather the more interesting analysis for a programmatic approach: the technical analysis.

Technical Analysis

The technical analysis deals with the quantification of the company's stock performance usually rather in a short or mid term timespan. There is a huge amount of so called indicators that are calculated from available stock data that traders use to estimate current price movements. Those indicators can also be visual patterns in financial charts like a "candlestick diagram" or other strategies like seasonal decomposition (although I think that this technique is preferably applied on rather less unsteady timeseries like yearly sales etc.). Since this project is more about the programmatic approach and less about the financial background, I will introduce just a few indicators for basic use cases.

All upcoming features/indicators will be stored in a copy of the stockHist data frame as "stockHist_comp" (comp = "complex")

Seasonality and Trend

With the help of the seasonal_decompose function of statsmodels, we can determine trend and seasonal patterns in the historic data. Assuming that we need a multiplicative decomposition, the components will look as in the following charts (period is set to 252 working days - approximately one year of data). The trend as its name already indicates shows the general movement of the stock price, whilst the seasonal component in the chosen timeperiod shows a seasonal/cyclic pattern indicating price movement happening each year. The residual component basically reflects uncertainty, stock volatility and unforseeable changes in price (e.g. Covid-19 impact at the beginning of 2020).

If you hide the residual by clicking on it, you will see the seasonal pattern that seems to occur each year. This could be provoked by the communication of quarterly reports (orange dashed lines in the next chart), seasonal client buying behaviour, distribution of dividends (green dashed lines in the next chart) or the possibility to buy "preference shares" (for BMW at the beginning of November).

The comparison of the residuals over the years can clearly show the impact of the pandemic start in march 2020 when all markets decreased significantly.

The lightblue trace (seasonality times mean of this years close_adj) shows that the tendencies are quite similar to the actual price movement (blue). Without regarding the unknown trend, the future close_adj could look like the red line taking ONLY seasonality into account. However, as seen in the charts before, the residual component is mostly outshining the seasonal component. Maybe there are other sectors/stocks with stronger seasonal or cyclic patterns.

Add common technical analysis features / indicators

A significant lightgreen area on top of close_adj suggests selling, while the lightgreen areas beneath close_adj suggest buying.

Crossings of the Bollinger Bands (BBperc would be negative or above 1) could indicate a soon movement back to the mean (SMA). Sell or buy recommendation would be triggered with re-entry into the bollinger band. The RSI is mostly used in combination with upper and lower limits such as 70%/30% indicating the stock to be overbought/oversold.

Create own features / indicators (experimental!)

Add future close_adj prices as well as return for each timestep
Add criterium that detects crossings of the bollinger band

Final feature: Buy, sell or keep recommendation

Machine Learning for Trend/Price/Return Prediction and Buy/Keep/Sell Recommendation

My Expectation when I started the project was that there must be two or three commonly used Techniques, but the deeper I dive into the field of Timeseries Forecasting and Machine Learning for Trading, the more I understand that I opened Pandora's Box. In purpose of clearing my mind I will try to form a cluster with the main strategies and algorithms that I found (by far not all) to be used for the given task(s).

After the research I decided to start with an Application of an LSTM Recurrent Neural Network which seems to be a popular approach for Timeseries Forecasting.

Thoughts on Possible Strategies

Training a model with multivariate input

Understanding the structure of TimeseriesGenerator-Elements

The Tensorflow Keras Module offers a function called TimeseriesGenerator, that divides past timeseries data into chunks/windows with seperate input arrays and target arrays. In order to understand its parameters I tried to visualize it in the following diagram, hoping it would serve my purpose of prediction. Every color in the plot stands for a single window of the whole timeseries, that will be used during training to predict one step ahead (big black dot). The window length, as well as its starting points can be parametrized.

Unfortunately, after having spent a lot of time understanding the function and structure of keras' TimeseriesGenerator, I found out that it supposedly only works for single future step prediction and not for multi-step & multi-output predictions (like several feature values for several days in the future). In these cases I would have to write my own functions to create timeserieswindows. However for the first edition of this project I will stay with a one-timestep-prediction of the close_adj value and use the timeseriesgenerator as described.

Predicting one value one step in the future with multivariate input

The following code will split the historic stock data including some indicators in a train and test data set. For both data sets the TimeseriesGenerator creates a bunch of windows for model training and validation. I will use 8 features as input for a timespan of 20 workdays (approx. a month) to predict the close_adj value of the next day.

I wrote a function to save the "latest" model. It will overwrite the last model but also create a copy in a backup folder.

In case that the training doesn't work as planned, or time shall be saved, the latest model can be loaded with the following command.

Backtesting: Pick Random History Data Chunk (from the test data set to avoid bias) in the according win_length format and compare prediction to true value

In order to visualize what the model predicts, I wrote some code that fits the model on a random timeseries window of the test data set and calculates an error for the difference between the predicted and the actual future close_adj price. This code snippet can be manually repeated for a better understanding. In a future edition of this project, I will use the code in a loop to calculate and compare the errors of different model parametrizations.

I don't understand why the shape of the prediction on the test_generator delivers 180 values per timeseries window! (see chapter "Questions to the Reviewer")

Thoughts/Findings on the machine Learning results

Improving the Model (not part of this version due to lack of time)

Using GridSearch or XGBoost Parameter Optimization
Data Preparation

Communicating Results

Saving data to server/database

Web App

I created my own homepage with the help of a free Bootstrap Template and published it on the web server of my personal Synology NAS that unfortunately does not provide an easy way to work with the Apache backend server. Maybe in the future I will deploy the web app on a platform like heroku with flask, but for now it is more convenient to execute the stock analysis seperately and only publish its results on my homepage:

http://kalinka.synology.me

Structure without Backend:

Structure for Backend Part (not implemented yet - difficulties with apache server configuration on personal NAS)

Visualizations

Finance Chart
Table with the values of the recent buy/sell/keep decision criteria

Report

Send Alert/Info Message via Mail

I wrote a small function that (with the help of a google API key) sends mails programmatically. It shall be used when finished to inform or alarm me if my server detects a strong buy or sell recommendation (according to my definitions above).

Use Case:

Conclusion

Even though my work on the project shall not end yet, this will be it for the first blog post. There is still a lot to improve and not all goals have been achived, but here is what I learned to each requirement from the outline in the beginning:

Future Prospects

During the project I continuesly developed more ideas about what I could do in the future to improve the functionalities and accuracy of the data. Here are some of them:

    - Portfolio Completion
      - include algorithm that iterates through stocks in a specific field/industry and identifies the ones with promising performance or lower price than value
    - MVO (Mean Variance Optimization): Design Portfolio to reach target return with minimal risk (use covariance e.g. negative correlation between stocks)
      - identify the efficient frontier (best portfolios (weights) in terms of return/risk)
    - Try other machine learning models and combine their results (inspect TFTs)
    - Estimate Confidence to evaluate predictions
    - Use Sentiment Analysis to estimate impacts on the stock markets that are hard to find with technical analysis

Questions to the Reviewer

Project:

  1. Are there state of the art stock prediction and/or timeseries forecast methods/strategies/models/classifiers that i haven't mentioned in this project, that i could look into?
  2. Is there a recommended (free) solution for sms/whatsapp messaging with python?
  3. Is the common use case, to train a model as good as possible, save it and then load and use it every day to predict the next days? Or is the goal to train the model each day new (with the newest data) and predict the next days?
    • I guess it's gonna be a compromise of saving weights, using the model to predict a few days/times and then improving the model by training it more with new data after some time?!
  4. In the Keras Documentation i found the "WindowGenerator"... Is this function now deprecated and equal to keras "TimeseriesGenerator" or why can't i find more information on WindowGenerator anywhere? Why wouldn't they update their tutorial? https://www.tensorflow.org/tutorials/structured_data/time_series
  5. I did not understand the batch_size parameter for the timeseriesgenerator. The number of timeseries windows generated is len(timeseriesgeneratorobject) and the window length (timesteps per window) is the parameter "length", so what is the meaning of the batch_size? Apparently increasing the batch_size shortens the training time a lot and reduces the number of windows...
  6. When i predict the test_generator object, i receive an array of the shape (number of windows, window length, 1) even though i thought the model would only predict one single value per window (not win_length predictions). What am i doing wrong?
  7. How could i implement my personal definition of error in the model training process above?
    • (instead of model.compile(loss = tf.losses.MeanSquaredError(), optimizer = tf.optimizers.Adam(), metrics = [tf.metrics.MeanAbsoluteError()])

General:

  1. Is there any way, I can get some statistics about my studying on the Udacity platform this year?
    • For example: Time spent on the platform across all courses (and separately for each course).

Thanks for the review in advance!

Sources & Credits